Covid-19 Data Analysis Project Overview

Introduction

We were challenged by our organization to make sense of a set of publicly available data surrounding the spread of the Covid-19 novel coronavirus. Presented with the data, we were required to ask meaningful questions revolving around certain metrics and factors that potentially played a part in the spread of Covid-19, as well as draw insightful conclusions based on any analytical techniques or visualization methodologies we would use on this set of data.

Questions

We started off the analysis by visually going over the data. What possible questions and what fundamental conclusions could we draw from the data headings provided? Through preliminary discussion we generated the following questions as a baseline for guidance and direction of our analysis.

  1. Are specific geographies affected by Covid-19 any worse than others?
  2. Does the rate of infection differ between countries?
  3. Is income class a good indicator of access to healthcare and in turn a variable that affects survival rates of the virus?
  4. Can we potentially predict the future rate of infection or future state of confirmed cases worldwide based on this data?

Assumptions

  • Cumulative values carry over from the last non-blank data point:
    • Under the cumulative values column, there were several points of data for some entities that were blank even though there were values above. Given that the values are supposed to be cumulative according to the column header, any blank data point was populated with the value of the last non-blank data point.
  • Cases per million and deaths per million columns had data that was already normalized
    • Explanation: The assumption is that since some countries have a population of less than one million, the “per million” data was taken as a ratio of cases or deaths per the population, then extrapolated and normalized to a scale of one million

Tools used

We used the Jupyter Notebook platform to process the provided data with Python libraries such as numpy, pandas, and plotly. The thought process and flow of analysis is documented below.

Data Wrangling

We'll start off by importing the Python libraries we will need to perform this analysis.

In [1]:
# import required Python libraries
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

Creating dataframe

Load the csv file into a Pandas dataframe.

In [2]:
# Read csv dataset into a Pandas dataframe
covid19_data = pd.read_csv(r'C:\Users\tayj2\Desktop\Untitled Folder\covid-19-data.csv')
In [3]:
# Create a tuple representing the dimensionality of our data
rows, cols = covid19_data.shape

# Print the number of columns and rows
print('Columns: {} Rows: {}'.format(cols, rows, ))
Columns: 9 Rows: 12381

As we can see, our dataset includes 9 columns and 12,381 rows of data.

Let's have a look at what columns of data are included.

In [4]:
# Display all column headers
covid19_data.columns
Out[4]:
Index(['Entity', 'Code', 'Date', 'Cumulative tests',
       'Cumulative tests per million', 'Total confirmed cases (cases)',
       'Confirmed cases per million (cases per million)',
       'Total confirmed deaths (deaths)',
       'Confirmed deaths per million (deaths per million)'],
      dtype='object')

Some of these column names seem self explanatory. We would expect to see a date in the 'Date' column and the last 6 columns look to be numbers representing quantative data points.

However, there are two columns, Entity and Code, that we can not immeadiatly identfy what the values in those columns represent.

Let's first take a look at what types values are in the Entity column.

In [5]:
# Display all unique values in the Entity column
covid19_data.Entity.unique()
Out[5]:
array(['Afghanistan', 'Africa', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Asia', 'Asia excl. China', 'Australia', 'Austria', 'Azerbaijan',
       'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
       'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia',
       'Bonaire Sint Eustatius and Saba', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'British Virgin Islands', 'Brunei',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Cape Verde', 'Cayman Islands',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Congo', 'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba',
       'Curacao', 'Cyprus', 'Czech Republic',
       'Democratic Republic of Congo', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Europe',
       'European Union', 'Faeroe Islands', 'Falkland Islands', 'Fiji',
       'Finland', 'France', 'French Polynesia', 'Gabon', 'Gambia',
       'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland',
       'Grenada', 'Guam', 'Guatemala', 'Guernsey', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'High income', 'Honduras',
       'Hungary', 'Iceland', 'India', 'India, people tested', 'Indonesia',
       'International', 'Iran', 'Iraq', 'Ireland', 'Isle of Man',
       'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan',
       'Kazakhstan', 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos',
       'Latvia', 'Lebanon', 'Liberia', 'Libya', 'Liechtenstein',
       'Lithuania', 'Low income', 'Lower middle income', 'Luxembourg',
       'Macedonia', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives',
       'Mali', 'Malta', 'Mauritania', 'Mauritius', 'Mexico', 'Moldova',
       'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco',
       'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',
       'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria',
       'North America', 'Northern Mariana Islands', 'Norway', 'Oceania',
       'Oman', 'Pakistan', 'Palestine', 'Panama', 'Papua New Guinea',
       'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal',
       'Puerto Rico', 'Qatar', 'Romania', 'Russia', 'Rwanda',
       'Saint Kitts and Nevis', 'Saint Lucia',
       'Saint Vincent and the Grenadines', 'San Marino',
       'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
       'Seychelles', 'Sierra Leone', 'Singapore',
       'Singapore, swabs tested', 'Sint Maarten (Dutch part)', 'Slovakia',
       'Slovenia', 'Somalia', 'South Africa', 'South America',
       'South Korea', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan',
       'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syria',
       'Taiwan', 'Tanzania', 'Thailand', 'Timor', 'Togo',
       'Trinidad and Tobago', 'Tunisia', 'Turkey',
       'Turks and Caicos Islands', 'Uganda', 'Ukraine',
       'United Arab Emirates', 'United Kingdom', 'United States',
       'United States Virgin Islands',
       'United States, specimens tested (CDC)', 'Upper middle income',
       'Uruguay', 'Uzbekistan', 'Vatican', 'Venezuela', 'Vietnam',
       'World', 'World excl. China', 'World excl. China and South Korea',
       'World excl. China, South Korea, Japan and Singapore', 'Yemen',
       'Zambia', 'Zimbabwe'], dtype=object)

So, the Entity column contatins a lot of unique values that appear to mostly represents country data. However, there are values that do not represent a country. There are four categories of values that represent data other than country specific:

  • Continent data (North America, South America, Europe, etc)
  • World data, including World data the excludes certain countires (World excl. China, World excl. China, South Korea, Japan and Singapore, etc)
  • Data that seems to represent individual testing data for specific areas (India, people tested, Singapore, swabs tested, United States, specimens tested (CDC))
  • Data related to income levels (Low income, Lower middle income, Upper middle income, High income)

Now let's see what values the Code column has.

In [6]:
# Display all unique values in the Code column
covid19_data.Code.unique()
Out[6]:
array(['AFG', nan, 'ALB', 'DZA', 'AND', 'AGO', 'AIA', 'ATG', 'ARG', 'ARM',
       'ABW', 'AUS', 'AUT', 'AZE', 'BHS', 'BHR', 'BGD', 'BRB', 'BLR',
       'BEL', 'BLZ', 'BEN', 'BMU', 'BTN', 'BOL', 'BES', 'BIH', 'BWA',
       'BRA', 'VGB', 'BRN', 'BGR', 'BFA', 'BDI', 'KHM', 'CMR', 'CAN',
       'CPV', 'CYM', 'CAF', 'TCD', 'CHL', 'CHN', 'COL', 'COG', 'CRI',
       'CIV', 'HRV', 'CUB', 'CUW', 'CYP', 'CZE', 'COD', 'DNK', 'DJI',
       'DMA', 'DOM', 'ECU', 'EGY', 'SLV', 'GNQ', 'ERI', 'EST', 'ETH',
       'FRO', 'FLK', 'FJI', 'FIN', 'FRA', 'PYF', 'GAB', 'GMB', 'GEO',
       'DEU', 'GHA', 'GIB', 'GRC', 'GRL', 'GRD', 'GUM', 'GTM', 'GGY',
       'GIN', 'GNB', 'GUY', 'HTI', 'HND', 'HUN', 'ISL', 'IND', 'IDN',
       'IRN', 'IRQ', 'IRL', 'IMN', 'ISR', 'ITA', 'JAM', 'JPN', 'JEY',
       'JOR', 'KAZ', 'KEN', 'OWID_KOS', 'KWT', 'KGZ', 'LAO', 'LVA', 'LBN',
       'LBR', 'LBY', 'LIE', 'LTU', 'LUX', 'MKD', 'MDG', 'MWI', 'MYS',
       'MDV', 'MLI', 'MLT', 'MRT', 'MUS', 'MEX', 'MDA', 'MCO', 'MNG',
       'MNE', 'MSR', 'MAR', 'MOZ', 'MMR', 'NAM', 'NPL', 'NLD', 'NCL',
       'NZL', 'NIC', 'NER', 'NGA', 'MNP', 'NOR', 'OMN', 'PAK', 'PSE',
       'PAN', 'PNG', 'PRY', 'PER', 'PHL', 'POL', 'PRT', 'PRI', 'QAT',
       'ROU', 'RUS', 'RWA', 'KNA', 'LCA', 'VCT', 'SMR', 'STP', 'SAU',
       'SEN', 'SRB', 'SYC', 'SLE', 'SGP', 'SXM', 'SVK', 'SVN', 'SOM',
       'ZAF', 'KOR', 'SSD', 'ESP', 'LKA', 'SDN', 'SUR', 'SWZ', 'SWE',
       'CHE', 'SYR', 'TWN', 'TZA', 'THA', 'TLS', 'TGO', 'TTO', 'TUN',
       'TUR', 'TCA', 'UGA', 'UKR', 'ARE', 'GBR', 'USA', 'VIR', 'URY',
       'UZB', 'VAT', 'VEN', 'VNM', 'OWID_WRL', 'YEM', 'ZMB', 'ZWE'],
      dtype=object)
In [7]:
nan_code = covid19_data[covid19_data['Code'].isna()]
nan_code.Entity.unique()
Out[7]:
array(['Africa', 'Asia', 'Asia excl. China', 'Europe', 'European Union',
       'High income', 'India, people tested', 'International',
       'Low income', 'Lower middle income', 'North America', 'Oceania',
       'Singapore, swabs tested', 'South America',
       'United States, specimens tested (CDC)', 'Upper middle income',
       'World excl. China', 'World excl. China and South Korea',
       'World excl. China, South Korea, Japan and Singapore'],
      dtype=object)
In [8]:
# Display first 5 rows
covid19_data.head()
Out[8]:
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)
0 Afghanistan AFG Dec 31, 2019 NaN NaN 0.0 0.0 0.0 0.0
1 Afghanistan AFG Jan 1, 2020 NaN NaN 0.0 0.0 0.0 0.0
2 Afghanistan AFG Jan 2, 2020 NaN NaN 0.0 0.0 0.0 0.0
3 Afghanistan AFG Jan 3, 2020 NaN NaN 0.0 0.0 0.0 0.0
4 Afghanistan AFG Jan 4, 2020 NaN NaN 0.0 0.0 0.0 0.0
In [9]:
# Display last 5 rows
covid19_data.tail()
Out[9]:
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)
12376 Zimbabwe ZWE Apr 9, 2020 NaN NaN 11.0 0.740096 2.0 0.134563
12377 Zimbabwe ZWE Apr 10, 2020 NaN NaN 11.0 0.740096 3.0 0.201844
12378 Zimbabwe ZWE Apr 11, 2020 NaN NaN 11.0 0.740096 3.0 0.201844
12379 Zimbabwe ZWE Apr 12, 2020 NaN NaN 14.0 0.941941 3.0 0.201844
12380 Zimbabwe ZWE Apr 13, 2020 NaN NaN 14.0 0.941941 3.0 0.201844

After reviewing the head and tail of the dataset there are a couple interseting observations:

  • We have NaN values where we would prefer to see number data. This may be due to the fact that no data was available so no value was recorded. In order to aid us with our analysis, for columns we would expect to see a number value, we will replace NaN with 0.0 to ensure our dataset is consistent
  • Some rows have no data for the tests, cases, and deaths data. We will go ahead and remove those rows in order to make the dataset a bit more manageable since it should not skew our data during analysis.

We will take care of these during the data cleaning process.

After a cursory review of the data in Excel, there is one more issue with the data set we need to address. Since the columns should represent a cumulative value from day to day, we should expect to see a value equal to or greater than the value from the previous day.

However, if we take a look at Austrailia for example, there are missing values. For example, on 3-Apr-20 there were 277,278 tests recorded, but on 4-Apr-20 there is no value recorded. We should expect to see at least 277,278 if no new tests had been recorded.

We will fill out the missing values in the dataset during the data cleaning process.

In [10]:
aus_data = covid19_data[(covid19_data['Code'] == 'AUS') & 
                        ((covid19_data['Date'] == '12-Apr-20') | 
                         (covid19_data['Date'] == '13-Apr-20'))]
display(aus_data)
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)

Data Cleaning

While exploring our dataset we found three areas were we can help improve our anaylsis by cleaning up the data:

  1. Replace all NaN values with 0.0 for the columns that we expect to see number data
  2. Fill out the number series were values appear to be missing (data on a specfic date, but no data on the following date)
  3. Remove rows that have no test, case, and death data
In [11]:
covid19_data.head()
Out[11]:
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)
0 Afghanistan AFG Dec 31, 2019 NaN NaN 0.0 0.0 0.0 0.0
1 Afghanistan AFG Jan 1, 2020 NaN NaN 0.0 0.0 0.0 0.0
2 Afghanistan AFG Jan 2, 2020 NaN NaN 0.0 0.0 0.0 0.0
3 Afghanistan AFG Jan 3, 2020 NaN NaN 0.0 0.0 0.0 0.0
4 Afghanistan AFG Jan 4, 2020 NaN NaN 0.0 0.0 0.0 0.0
In [12]:
# Create a list containing column names we want to replace NAN
fix_nan = ['Cumulative tests', 
           'Cumulative tests per million', 
           'Total confirmed cases (cases)', 
           'Confirmed cases per million (cases per million)', 
           'Total confirmed deaths (deaths)', 
           'Confirmed deaths per million (deaths per million)']

# loop through the list of column names and replace NAN with 0.0
for column in fix_nan:
    covid19_data[column] = covid19_data[column].fillna(0.0)
    
covid19_data.head()
Out[12]:
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)
0 Afghanistan AFG Dec 31, 2019 0.0 0.0 0.0 0.0 0.0 0.0
1 Afghanistan AFG Jan 1, 2020 0.0 0.0 0.0 0.0 0.0 0.0
2 Afghanistan AFG Jan 2, 2020 0.0 0.0 0.0 0.0 0.0 0.0
3 Afghanistan AFG Jan 3, 2020 0.0 0.0 0.0 0.0 0.0 0.0
4 Afghanistan AFG Jan 4, 2020 0.0 0.0 0.0 0.0 0.0 0.0
In [13]:
# Create new empty dataframe with the same columns from covid19_data
df_columns = covid19_data.columns
covid19_data_complete = pd.DataFrame(columns=df_columns)

# Create a list of all unique entries from covid19_data
entities = covid19_data.Entity.unique()

#create list of columns we want to fill out the data series
columns = ['Cumulative tests', 
           'Cumulative tests per million', 
           'Total confirmed cases (cases)', 
           'Confirmed cases per million (cases per million)', 
           'Total confirmed deaths (deaths)', 
           'Confirmed deaths per million (deaths per million)']

# Loop through each unique entity, create a temporary data frame and loop through each column and row.
# If the value in the column is 0.0 update the value to the value in the previous row.
# Append the temporary dataframe to the new covid_data_complete dataframe.
for entity in entities:
    temp_df = covid19_data[covid19_data['Entity'] == entity].reset_index(drop=True)
    for i in range (1, len(temp_df)):
        for column in columns:
            if temp_df.loc[i, column] == 0.0:
                temp_df.loc[i, column] = temp_df.loc[i-1, column]
    covid19_data_complete = covid19_data_complete.append(temp_df, ignore_index=True)
In [14]:
aus_data = covid19_data_complete[(covid19_data_complete['Code'] == 'AUS') & 
                        ((covid19_data_complete['Date'] == '12-Apr-20') | 
                         (covid19_data_complete['Date'] == '13-Apr-20'))]
display(aus_data)
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)
In [15]:
# Remove rows that have no test, cases, and death data and create a new dataframe
covid19_data_filtered = covid19_data_complete.drop(
    covid19_data_complete[(covid19_data_complete['Cumulative tests'] == 0.0) &
                                    (covid19_data_complete['Cumulative tests per million'] == 0.0) &
                                    (covid19_data_complete['Cumulative tests per million'] == 0.0) &
                                    (covid19_data_complete['Total confirmed cases (cases)'] == 0.0) &
                                    (covid19_data_complete['Confirmed cases per million (cases per million)'] == 0.0) &
                                    (covid19_data_complete['Total confirmed deaths (deaths)'] == 0.0) &
                                    (covid19_data_complete['Confirmed deaths per million (deaths per million)'] == 0.0)
                                    ].index)
covid19_data_filtered.head()
Out[15]:
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)
56 Afghanistan AFG Feb 25, 2020 0.0 0.0 1.0 0.025688 0.0 0.0
57 Afghanistan AFG Feb 26, 2020 0.0 0.0 1.0 0.025688 0.0 0.0
58 Afghanistan AFG Feb 27, 2020 0.0 0.0 1.0 0.025688 0.0 0.0
59 Afghanistan AFG Feb 28, 2020 0.0 0.0 1.0 0.025688 0.0 0.0
60 Afghanistan AFG Feb 29, 2020 0.0 0.0 1.0 0.025688 0.0 0.0
In [16]:
# Convert the Date column to the date format YYYY-MM-DD
covid19_data_filtered['Date']= pd.to_datetime(covid19_data_filtered['Date'])
covid19_data_filtered['Date'] = covid19_data_filtered['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))

Exploratory Data Analysis

Question 1: Are specific geographies affected by Covid-19 any worse than others?

Visualizing the spread through a dynamic choropleth will make it easier for viewers to understand the spread of Covid-19 across the world over the time period of 4 months based on the data provided. This section covers the code used to create these visualizations.

In [17]:
# Create dateframe containing only Countries
by_country_data = covid19_data_filtered[covid19_data_filtered['Code'].notna()]

# Remove 'World' from the Coutnries dataframe
by_country_data = by_country_data[by_country_data.Entity != 'World']

by_country_data = by_country_data.sort_values('Date')

display(by_country_data)
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)
2301 China CHN 2019-12-31 0.0 0.0 27.0 0.018759 0.0 0.000000
2302 China CHN 2020-01-01 0.0 0.0 27.0 0.018759 0.0 0.000000
2303 China CHN 2020-01-02 0.0 0.0 27.0 0.018759 0.0 0.000000
2304 China CHN 2020-01-03 0.0 0.0 44.0 0.030570 0.0 0.000000
2305 China CHN 2020-01-04 0.0 0.0 44.0 0.030570 0.0 0.000000
... ... ... ... ... ... ... ... ... ...
4626 Guinea-Bissau GNB 2020-04-13 0.0 0.0 39.0 19.817093 0.0 0.000000
4656 Guyana GUY 2020-04-13 0.0 0.0 45.0 57.211220 6.0 7.628163
4681 Haiti HTI 2020-04-13 0.0 0.0 33.0 2.894094 3.0 0.263099
4475 Greenland GRL 2020-04-13 0.0 0.0 11.0 193.757486 0.0 0.000000
12380 Zimbabwe ZWE 2020-04-13 0.0 0.0 14.0 0.941941 3.0 0.201844

7649 rows × 9 columns

In [18]:
most_recent = by_country_data[by_country_data['Date'] == '2020-04-13'].reset_index(drop=True)
display(most_recent)
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)
0 Portugal PRT 2020-04-13 0.0 0.000000 16585.0 1626.505498 504.0 49.427722
1 Argentina ARG 2020-04-13 18027.0 396.107272 2208.0 48.854122 95.0 2.101966
2 Sierra Leone SLE 2020-04-13 0.0 0.000000 10.0 1.253606 0.0 0.000000
3 Puerto Rico PRI 2020-04-13 0.0 0.000000 897.0 313.544274 44.0 15.380098
4 Palestine PSE 2020-04-13 0.0 0.000000 268.0 52.534434 2.0 0.392048
... ... ... ... ... ... ... ... ... ...
200 Guinea-Bissau GNB 2020-04-13 0.0 0.000000 39.0 19.817093 0.0 0.000000
201 Guyana GUY 2020-04-13 0.0 0.000000 45.0 57.211220 6.0 7.628163
202 Haiti HTI 2020-04-13 0.0 0.000000 33.0 2.894094 3.0 0.263099
203 Greenland GRL 2020-04-13 0.0 0.000000 11.0 193.757486 0.0 0.000000
204 Zimbabwe ZWE 2020-04-13 0.0 0.000000 14.0 0.941941 3.0 0.201844

205 rows × 9 columns

In [19]:
# Drop rows prior to 2020-01-20. Data before this date is to small to see on the scatter map.
by_country_scattergeo = by_country_data.drop(by_country_data[(by_country_data['Date'] < '2020-01-20')].index)

fig = px.scatter_geo(by_country_scattergeo, locations='Code',
                     size = 'Total confirmed cases (cases)',
                     animation_frame = 'Date',
                     size_max = 100,
                     opacity = .5,
                     title = 'Total confirmed cases by country',
                    )
fig.add_trace(go.Choropleth(
                locations = most_recent['Code'],
                z = most_recent['Total confirmed cases (cases)'],
                colorscale = 'Reds',
                ))
fig.update_layout(transition = {'duration': 100})


fig.show()
In [20]:
fig = px.scatter_geo(by_country_scattergeo, locations='Code',
                     size = 'Confirmed cases per million (cases per million)',
                     animation_frame = 'Date',
                     size_max = 100,
                     opacity = .5,
                     title = 'Confirmed cases per million by country',
                    )
fig.add_trace(go.Choropleth(
                locations = most_recent['Code'],
                z = most_recent['Confirmed cases per million (cases per million)'],
                colorscale = 'Reds',
                ))
fig.update_layout(transition = {'duration': 100})


fig.show()

Finding 1

The visualizations show us that although some countries have higher total cases confirmed, this does not equate to severity due to the fact that countries have different population sizes from each other. The second visualization for cases per million by country, shows normalized data that accounts for this fact and presents the data in a much more accurate way, showing precisely which countries were hit harder by the spread of Covid-19 with respect to population size.

Question 2: Does the rate of infection differ between countries?

We used a racing bar chart to attempt to depict visually how Covid-19 spread across the world over the time period of 4 months. The bar charts show the top 10 countries over time with respect to total number of confirmed cases. This section shows the code we used to try to answer this question.

In [21]:
day = by_country_data['Date'].min()

by_country_data.sort_values(by=['Date', 'Total confirmed cases (cases)'], ascending=[True, False], inplace=True)

days = by_country_data['Date'].unique()

countries = by_country_data['Entity'].unique()

bar_colors = {'China':'rgb(59, 23, 219)', 
          'United States':'rgb(21, 188, 145)',
          'Thailand':'rgb(194, 26, 221)',
          'Japan':'rgb(116, 237, 137)',
          'South Korea':'rgb(168, 163, 174)',
          'Taiwan':'rgb(176, 172, 46)',
          'Singapore':'rgb(119, 68, 60)',
          'Vietnam':'rgb(245, 102, 89)',
          'Malaysia':'rgb(218, 102, 210)',
          'France':'rgb(139, 169, 131)',
          'Australia':'rgb(236, 21, 239)',
          'Germany':'rgb(180, 197, 157)',
          'Iran':'rgb(222, 128, 94)',
          'Italy':'rgb(73, 90, 242)',
          'Bahrain':'rgb(16, 69, 83)',
          'Kuwait':'rgb(86, 225, 80)',
          'Spain':'rgb(213, 108, 52)',
          'Switzerland':'rgb(44, 140, 111)',
          'Denmark':'rgb(107, 163, 213)',
          'United Kingdom':'rgb(73, 21, 203)',
          'Netherlands':'rgb(183, 43, 223)',
          'Belgium':'rgb(122, 249, 136)',
          'Turkey':'rgb(28, 225, 69)',
         }


fig_dict = {
    'data': [],
    'layout': {},
    'frames': []
}

# Create the layout
fig_dict['layout']['title'] ='Total confirmed cases by country - Top 10 (log)'
fig_dict['layout']['xaxis'] = {'title': 'Country'}
fig_dict['layout']['yaxis'] = {'title': 'Total confirmed cases (log)', 'type': 'log'}
fig_dict['layout']['showlegend'] = False
fig_dict['layout']['hovermode'] = 'closest'
fig_dict['layout']['updatemenus'] = [
    {
        'buttons': [
            {
                'args': [None, {'frame': {'duration': 500, 'redraw': True},
                                'fromcurrent': True, 'transition': {'duration': 300,
                                                                    'easing': 'quadratic-in-out'}}],
                'label': 'Play',
                'method': 'animate'
            },
            {
                'args': [[None], {'frame': {'duration': 0, 'redraw': False},
                                  'mode': 'immediate',
                                  'transition': {'duration': 0}}],
                'label': 'Pause',
                'method': 'animate'
            }
        ],
        'direction': 'left',
        'pad': {'r': 10, 't': 87},
        'showactive': False,
        'type': 'buttons',
        'x': 0.1,
        'xanchor': 'right',
        'y': 0,
        'yanchor': 'top'
    }
]

sliders_dict = {
    'active': 0,
    'yanchor': 'top',
    'xanchor': 'left',
    'currentvalue': {
        'font': {'size': 20},
        'prefix': 'Date:',
        'visible': True,
        'xanchor': 'right'
    },
    'transition': {'duration': 300, 'easing': 'linear'},
    'pad': {'b': 10, 't': 50},
    'len': 0.9,
    'x': 0.1,
    'y': 0,
    'steps': []
}


# Create inital data

for country in countries:
    data_by_day = by_country_data[by_country_data['Date'] == day]
    data_by_day = data_by_day.nlargest(10,'Total confirmed cases (cases)')
    data_by_day_and_country = data_by_day[data_by_day['Entity'] == country]
    data_dict = {
        'type': 'bar',
        'x': list(data_by_day_and_country['Entity']),
        'y': list(data_by_day_and_country['Total confirmed cases (cases)']),
        'name': country,
        'marker_color': bar_colors['China'],
        }
        
    fig_dict['data'].append(data_dict)
    
# Make frames for animation
for day in days:
    data_by_day = by_country_data[by_country_data['Date'] == day]
    data_by_day = data_by_day.nlargest(10,'Total confirmed cases (cases)')
    data_by_day.sort_values(by='Total confirmed cases (cases)', ascending= False, inplace=True)
    countries = data_by_day['Entity'].unique()
    frame = {'data': [], 'name': day}
    for country in countries:
        #data_by_day = by_country_data[by_country_data['Date'] == day]
        #data_by_day = data_by_day.nlargest(10,'Total confirmed cases (cases)')
        data_by_day_and_country = data_by_day[data_by_day['Entity'] == country]
        data_dict = {
            'type': 'bar',
            'x': list(data_by_day_and_country['Entity']),
            'y': list(data_by_day_and_country['Total confirmed cases (cases)']),
            'name': country,
            'marker_color': bar_colors[country]
        }
        frame['data'].append(data_dict)
        
    fig_dict['frames'].append(frame)
    slider_step = {'args': [
        [day],
        {'frame': {'duration': 1000, 'redraw': False},
         'mode': 'immediate',
         'transition': {"duration": 300}}
    ],
        'label': day,
        'method': 'animate'}
    sliders_dict['steps'].append(slider_step)


fig_dict['layout']['sliders'] = [sliders_dict]
fig = go.Figure(fig_dict)  
fig.show()

Finding 2

From the visualization it becomes apparent that China was the leader for number of cases across the world, however towards the end of the data set showing the first 4 months of the spread the United States took over as the most affected country by Covid-19 by a wide margin.

Question 3: Is income class a good indicator of access to healthcare and in turn a variable that affects survival rates of the virus?

This question revolves around trying to figure out if there are any major factors we can retrieve from our data that affect the spread of and survival rate of Covid-19. This section below shows the code we used to try to answer this question.

In [22]:
# pull out income data into it's own subset data frame, then each income category
income_data = covid19_data[(covid19_data['Entity'] == 'High income') | (covid19_data['Entity'] == 'Low income')| (covid19_data['Entity'] == 'Lower middle income') |(covid19_data['Entity'] == 'Upper middle income')]


highincome = income_data[income_data['Entity']=="High income"]

lowincome = income_data[income_data['Entity']=="Low income"]

lmincome = income_data[income_data['Entity']=="Lower middle income"]

umincome = income_data[income_data['Entity']=="Upper middle income"]
In [23]:
# Plot the income data over time using the total confirmed cases as the y axis value.
fig = go.Figure()
fig.add_trace(go.Scatter(x=highincome.Date, y=highincome["Total confirmed cases (cases)"], name='High Income',
                         line=dict(color='rgb(0,128,255)', width=4)))
fig.add_trace(go.Scatter(x=umincome.Date, y=umincome["Total confirmed cases (cases)"], name='Upper Middle Income',
                         line=dict(color='rgb(153,229,255)', width=4)))
fig.add_trace(go.Scatter(x=lmincome.Date, y=lmincome["Total confirmed cases (cases)"], name='Lower Middle Income',
                         line=dict(color='rgb(255,153,153)', width=4)))
fig.add_trace(go.Scatter(x=lowincome.Date, y=lowincome["Total confirmed cases (cases)"], name='Low Income',
                         line=dict(color='rgb(255,0,0)', width=4)))

fig.update_layout(
    title="Total confirmed cases of Covid-19 of income classes from Jan 2020 to Apr 2020",
    xaxis_title="Date",
    yaxis_title="Total number of cases of Covid-19",
    template="plotly_dark",
    font=dict(
        family="arial",
        size=18,
    )
    
)
In [24]:
# Plot the income data over time using the total deaths as the y axis value.
fig = go.Figure()
fig.add_trace(go.Scatter(x=highincome.Date, y=highincome["Total confirmed deaths (deaths)"], name='High Income',
                         line=dict(color='rgb(0,128,255)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=umincome.Date, y=umincome["Total confirmed deaths (deaths)"], name='Upper Middle Income',
                         line=dict(color='rgb(153,229,255)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=lmincome.Date, y=lmincome["Total confirmed deaths (deaths)"], name='Lower Middle Income',
                         line=dict(color='rgb(255,153,153)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=lowincome.Date, y=lowincome["Total confirmed deaths (deaths)"], name='Low Income',
                         line=dict(color='rgb(255,0,0)', width=4, dash="dot")))

fig.update_layout(
    title="Total confirmed deaths from Covid-19 of income classes from Jan 2020 to Apr 2020",
    xaxis_title="Date",
    yaxis_title="Confirmed deaths from Covid-19",
    template="plotly_dark",
    font=dict(
        family="arial",
        size=18,
    )
    
)
In [25]:
# Plot the ratio of the total confirmed deaths against the total confirmed cases as the y axis value.
fig = go.Figure()
fig.add_trace(go.Scatter(x=highincome.Date, y=(highincome["Total confirmed deaths (deaths)"]/highincome["Total confirmed cases (cases)"]), name='High Income',
                         line=dict(color='rgb(0,128,255)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=umincome.Date, y=(umincome["Total confirmed deaths (deaths)"]/umincome["Total confirmed cases (cases)"]), name='Upper Middle Income',
                         line=dict(color='rgb(153,229,255)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=lmincome.Date, y=(lmincome["Total confirmed deaths (deaths)"]/lmincome["Total confirmed cases (cases)"]), name='Lower Middle Income',
                         line=dict(color='rgb(255,153,153)', width=4, dash="dot")))
fig.add_trace(go.Scatter(x=lowincome.Date, y=(lowincome["Total confirmed deaths (deaths)"]/lowincome["Total confirmed cases (cases)"]), name='Low Income',
                         line=dict(color='rgb(255,0,0)', width=4, dash="dot")))

fig.update_layout(
    title="Ratio of deaths to cases of Covid-19 from each income class over time",
    xaxis_title="Date",
    yaxis_title="Deaths per Case from Covid-19",
    template="plotly_dark",
    font=dict(
        family="arial",
        size=18,
    )
    
)

Finding 3

  • The first 2 charts show similar trends within each income class with respect to cases and deaths of Covid-19
    • These charts may not be depicting an accurate representation of true numbers for several reasons:
      • People in lower income classes may have limited access to test kits, therefore showing a lower confirmed case count than higher income classes
      • The same holds true for confirmed deaths, as those numbers only apply when the death is related to a confirmed case, which may be limited in lower income classes for the aforementioned reason
  • Instead, the 3rd chart shows a ratio of the number of deaths to the number of confirmed cases over time per income class
    • This depicts a more accurate representation of how income class impacts confirmed cases and the survival rate of each
    • From the chart we see that the Upper middle, Lower middle and low income classes at the end of the data set given start to stabilize at the same level. Additionally, contrary to common understanding that people of high income class have better access to healthcare, we actually observe that they have a higher death count amongst confirmed cases.

The conclusion that can be drawn is that at the end of time period of the data given, there is very little difference in the number of deaths per case between income classes. The biggest outlier was the high income category with a 6% mortality rate compared to the others that hovered around 4%.

Question 4: Can we potentially predict the future rate of infection or future state of confirmed cases worldwide based on this data?

In [26]:
# pull out world data into it's own subset data frame
world_data = covid19_data[(covid19_data['Code'] == 'OWID_WRL')]

display(world_data)
Entity Code Date Cumulative tests Cumulative tests per million Total confirmed cases (cases) Confirmed cases per million (cases per million) Total confirmed deaths (deaths) Confirmed deaths per million (deaths per million)
11907 World OWID_WRL Dec 31, 2019 0.0 0.0 27.0 0.003464 0.0 0.000000
11908 World OWID_WRL Jan 1, 2020 0.0 0.0 27.0 0.003464 0.0 0.000000
11909 World OWID_WRL Jan 2, 2020 0.0 0.0 27.0 0.003464 0.0 0.000000
11910 World OWID_WRL Jan 3, 2020 0.0 0.0 44.0 0.005645 0.0 0.000000
11911 World OWID_WRL Jan 4, 2020 0.0 0.0 44.0 0.005645 0.0 0.000000
... ... ... ... ... ... ... ... ... ...
12007 World OWID_WRL Apr 9, 2020 0.0 0.0 1476818.0 189.461980 87816.0 11.265974
12008 World OWID_WRL Apr 10, 2020 0.0 0.0 1563857.0 200.628272 95039.0 12.192618
12009 World OWID_WRL Apr 11, 2020 0.0 0.0 1653206.0 212.090916 102088.0 13.096939
12010 World OWID_WRL Apr 12, 2020 0.0 0.0 1734913.0 222.573162 108192.0 13.880025
12011 World OWID_WRL Apr 13, 2020 0.0 0.0 1807308.0 231.860765 113513.0 14.562660

105 rows × 9 columns

In [29]:
w_rows, w_cols = world_data.shape
w_list = list(range(0, w_rows))
w_dat = world_data["Total confirmed cases (cases)"]

# calculate polynomial regression
z = np.polyfit(w_list, w_dat, 6)
f = np.poly1d(z)

#calculate best fit 
y_new = f(w_list)
ybar = np.sum(w_dat)/len(w_dat)
ssreg = np.sum((y_new-ybar)**2)
sstot = np.sum((w_dat - ybar)**2)
r_squared = ssreg/sstot
print("The R-Squared value of the polynomial regression line to the 6th Degree is ", r_squared)


#plot the scatter plot of the world data
polyfit = go.Scatter(
                  x=world_data.Date,
                  y=y_new,
                  mode='lines',
                  marker=go.Marker(color='rgb(31, 119, 90)'),
                  name='Trendline'
                  )

worlddat = go.Scatter(x=world_data.Date, y=(world_data["Total confirmed cases (cases)"]), 
                          name='World', 
                          mode="markers",
                          marker=dict(color='red', size=3),
                          line=dict(color='rgb(0,128,255)', width=4, )
                     )

fig = go.Figure()

fig.add_trace(worlddat)

fig.add_trace(polyfit)


fig.update_layout(
    title="Confirmed cases worldwide over time",
    xaxis_title="Date",
    yaxis_title="Confirmed cases of Covid-19",
    template="plotly_dark",
    font=dict(family="arial",size=18),
    annotations=
        [dict(
            x=50,
            y=700000,
            text="-5.81e-05 X^6 - 0.01825X^5  - 2.057X^4 + 104.2X^3 - 2328X^2 + 1.956e+04X - 3.461e+04",
            showarrow=False,
            font=dict(family="arial",size=12, color='rgb(31, 119, 90)' ),
        )]
    
)
The R-Squared value of the polynomial regression line to the 6th Degree is  0.9988003325876159

Finding 4

  • The trendline utilized a polynomial function to an order of 6, with an R-Squared value of 99.88%
  • Having a high R-square value means that this trendline can predict with high accuracy given the current amount of data what those numbers could look like in the future.
    • There are limitations to using this model to predict case numbers for dates too far out in the future, given that there are many variables and other indicators not covered in this set of data that will affect the actual number that will occur.
    • Even though there are limitations, by collecting more data over time, there will be more opportunity to modify and improve the accuracy of the model with the inclusion of other variables that may impact the actual number of cases worldwide.

Conclusion

The data set provided gives us a good foundation for visualizing the spread of Covid-19. The ability to see how the virus spread across the world throughout the timescale provides some insight to the current globalization of our society and the ease of which people can move about the world. It would appear that for the most part, death rates of the virus are indiscriminate of income levels.

  • Limitations:
    • The data does not provide any additional factors that may impact how we view the contagiousness of the virus (e.g. recovery rate, availability of test kits per country, etc)
    • Even though this set of data was used to form a predictive model for potentially foretelling the number of cases in a future state, it should be primarily used for historical analysis instead of predictive, given the lack of external factors that could impact the conclusions we are drawing from the data.